Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(clustering/rpc): sync retry timeout due to block #14195

Merged
merged 2 commits into from
Jan 21, 2025
Merged

Conversation

StarlightIbuki
Copy link
Contributor

Summary

timeout should not be considered a failure trial

Checklist

  • The Pull Request has tests
  • A changelog file has been created under changelog/unreleased/kong or skip-changelog label added on PR if changelog is unnecessary. README.md
  • There is a user-facing docs PR against https://github.com/Kong/docs.konghq.com - PUT DOCS PR HERE

Issue reference

KAG-6224

@github-actions github-actions bot added core/clustering cherry-pick kong-ee schedule this PR for cherry-picking to kong/kong-ee labels Jan 20, 2025
@StarlightIbuki StarlightIbuki force-pushed the fix/timeout-retry branch 2 times, most recently from 1b010c6 to 8c14f00 Compare January 20, 2025 07:59
@StarlightIbuki StarlightIbuki force-pushed the fix/timeout-retry branch 2 times, most recently from 7c4fcb6 to b89f0e5 Compare January 20, 2025 09:13
Copy link
Contributor

@ADD-SP ADD-SP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    kong/clustering/services/sync/rpc.lua:380:16: unused function start_sync_once_timer

Could you fix the linting?

@StarlightIbuki
Copy link
Contributor Author

    kong/clustering/services/sync/rpc.lua:380:16: unused function start_sync_once_timer

Could you fix the linting?

done

@StarlightIbuki StarlightIbuki force-pushed the fix/timeout-retry branch 3 times, most recently from abd0016 to 1ee8325 Compare January 20, 2025 09:52
@pull-request-size pull-request-size bot added size/M and removed size/S labels Jan 20, 2025
@StarlightIbuki StarlightIbuki force-pushed the fix/timeout-retry branch 2 times, most recently from 6570804 to c43194d Compare January 20, 2025 10:37
timeout should not be considered a failure trial

KAG-6224
@chronolaw chronolaw changed the title fix(rpc): sync retry timeout due to block fix(clustering/rpc): sync retry timeout due to block Jan 21, 2025
if retry_count > MAX_RETRY then
ngx_log(ngx_ERR, "sync_once retry count exceeded. retry_count: ", retry_count)
return
end

return start_sync_once_timer(retry_count + 1)
-- we do not count a timed out sync. just retry
if err ~= "timeout" then
Copy link
Contributor

@chronolaw chronolaw Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now timeout option is 0:

local SYNC_MUTEX_OPTS = { name = "get_delta", timeout = 0, }

So how the error timeout will happen? Or it will always be timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chronolaw This is a locked operation, where timeout = 0 means it will fail instantly if the lock can not be acquired. By tolerating timeout in this case, we allow sync_once to actually retry instead of giving up when another coroutine holds the lock

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that it will always get a timeout error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chronolaw No. When the callback fails or errors out it will emit a different error. That is what we count for a real error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e, if we have no error but the version still does not match, we count it as 1 trial; if it's a real failure (internal or external), we still count it as 1 trial. This way we prevent it from looping forever; but meanwhile we do not want a running coroutine to block the sync_once call and make it quickly exhaust all the chances, so we do not count timeouts(lock failure)

@ADD-SP ADD-SP merged commit dfc955e into master Jan 21, 2025
28 checks passed
@ADD-SP ADD-SP deleted the fix/timeout-retry branch January 21, 2025 07:47
@ADD-SP ADD-SP added the incomplete-cherry-pick A cherry-pick was incomplete and needs manual intervention label Jan 21, 2025
@chronolaw
Copy link
Contributor

@StarlightIbuki do not forget cherry-pick.

@ADD-SP ADD-SP removed the incomplete-cherry-pick A cherry-pick was incomplete and needs manual intervention label Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick kong-ee schedule this PR for cherry-picking to kong/kong-ee core/clustering size/M skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants